In [ ]:
## main packages
import requests, pandas as pd, numpy as np, geopandas as gpd, json
In [ ]:
## packages for visualisations
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import altair as alt
In [ ]:
## packages for modelling
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor     # used in 4.2
from scipy.stats import pearsonr    # used in 4.2 for correlations

Run on M1 MacBook in Conda environment running Python 3.9.15 (Had issues running >3.10)

  • find package + dependency versions

Step 1 - Crawl a real-world dataset¶

This project explores London crime behaviour using the freely available Crime API published by the UK police.

Specifically, the crime data is downloaded from the https://data.police.uk/api/crimes-street/ endpoint. This allows querying by crime category, location, and monthly date.

'all-crimes' can be passed as a category to retrieve data for all categorys. Locations can be specified as a coordinates pair, which returns crime incidents within a 1-mile radius, or as a set of coordinate pairs to create a polygon within which crimes will be returned. The api holds data from the last 3 years, with results aggregated in each month. As of 03/12/2022, the period available is October 2019 -> September 2022.

There are some limitations to consider:

  • error code 503 - if there are more than 10,000 results the api will return
  • error code 400 - API calls using get requests are limited to 4094 characters, so if over 4094 use post request instead.

To allow for easier categorisation and to stay within the results limit, the api is called for multiple areas and combined into a main Greater London dataset. Results by borough are suitably small enough to stay within the limit, but could equally be called for smaller areas, such as wards, MSOA or LSOA.

Specifying these areas requires a set of coordinates making up each area's perimeter, so the data gathering follows the process:

1. Find coordinates for London area
2. Format into lat/lng pairs akin to api spec and retrieve crime data
3. Repeat for multiple months to create time-series
4. Repeat steps 1-3 for each area

1.1 - Finding location data¶

Gathering location data can be near-automated by selecting a commonly used geographical heirarchy. For example, some commonly used hierarchies are Local Authority Districts (ie London Boroughs), Electoral Wards, Middle layer Super Output Area (MSOA, around 7500 inhabitants), Lower layer Super Output Area (LSOA, around 1500 inhabitants) or Output Area (OA, around 300 inhabitants).

Shapefiles for mapping are easily available from variety of sources and can be used to extract the coordinates for the desired areas. These are generally available at a national level, but other pre-filtered files can often be found for important areas.

ArcGIS hosts a huge database of map files, most of which can be manually downloaded in multiple file formats or accessed through their api.

Using a get request, the 'borough boundaries' dataset can be downloaded in GeoJSON format - a version of JSON for storing geographic data structures. Then, using the geopandas python module, this is loaded into a GeoDataFrame - a pandas dataframe with a geometry column. Each row represents a geometric object, for instance a London borough, so any accompanying information in the mapping file can be easily explored.

In [ ]:
boroughs = requests.get('https://services.arcgis.com/drifeOPKLpgnJ8Qa/arcgis/rest/services/borough boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson')
df_boroughs = gpd.GeoDataFrame.from_features(boroughs.json())

## borough boundaries also available in external_data folder incase dataset/API becomes unavailable through that link.
# df_boroughs = gpd.read_file('external_data/Borough_Boundaries.geojson')
In [ ]:
df_boroughs.tail()
Out[ ]:
geometry FID ogc_fid name gss_code hectares nonld_area ons_inner sub_2011 Shape__Area Shape__Length
28 POLYGON ((-0.09764 51.57365, -0.09753 51.57368... 29 29 Hackney E09000012 1904.902 0.000 T East 4.921020e+07 40792.587675
29 POLYGON ((-0.14239 51.56912, -0.14239 51.56928... 30 30 Haringey E09000014 2959.837 0.000 T North 7.659002e+07 46679.448320
30 POLYGON ((0.02907 51.49609, 0.02780 51.49602, ... 31 31 Newham E09000025 3857.806 237.637 T East 9.954356e+07 51121.484382
31 POLYGON ((0.09973 51.51190, 0.09976 51.51358, ... 32 32 Barking and Dagenham E09000002 3779.934 169.150 F East 9.760414e+07 59416.221750
32 POLYGON ((-0.11158 51.51534, -0.11184 51.51580... 33 33 City of London E09000001 314.942 24.546 T Central 8.122667e+06 15417.507853

Each geometry object contains a polygon of coordinates, along with the relevant names and codes. So to obtain coordinates, the polygon responding to each area must be extracted and cleaned into the correct format.

Can also extract some columns to add as features to main data later.

In [ ]:
## GeoDataFrame so convert GDF columns to lists first then create columns in new DF
b_names = df_boroughs['name']
b_area = df_boroughs['hectares']
b_inner = df_boroughs['ons_inner']
b_compass = df_boroughs['sub_2011']
In [ ]:
borough_feat = pd.DataFrame()
borough_feat['Borough'] = b_names
borough_feat['Hectares'] = b_area
borough_feat['Inner'] = b_inner
borough_feat['Area'] = b_compass
In [ ]:
## replace t and f with True and False
borough_feat['Inner'] = (borough_feat['Inner'] == 'T' )
In [ ]:
borough_feat.tail()
Out[ ]:
Borough Hectares Inner Area
28 Hackney 1904.902 True East
29 Haringey 2959.837 True North
30 Newham 3857.806 True East
31 Barking and Dagenham 3779.934 False East
32 City of London 314.942 True Central

1.2 - Extracting crime data¶

In [ ]:
## create a list of all the area names to iterate through
areas = df_boroughs['name'].tolist()

For each area in list:

  • extract polygon coordinates
  • format
  • call api
  • format results
  • append to any previous results
In [ ]:
## define function that extracts polygon coordinates from dataframe

def get_coords(df):
    ## extract coordinates: 1. convert into dictionary, 2. extract coordinates into numpy array
    dict = json.loads(df.to_json())
    coords = np.array(dict['features'][0]['geometry']['coordinates'])

    ## round coordinates to 6dp - roughly 0.11m precision (improves api performance)
    coords = np.round_(coords, decimals = 6)

    ## build polygon string by appending coordinates in reverse order (lat,lon:lat,lon:lat,lon...)
    # first coordinates not preceded by colon
    poly = 'poly=' + str(coords[0][0][1]) + ',' + str(coords[0][0][0])     # converts to string so can concatenate

    for n in range(1, coords.shape[1]):        ## iterate through length of area, ie from 2nd coordinate pair through to last coordinate pair
        poly = poly + ':' + str(coords[0][n][1]) + ',' + str(coords[0][n][0])

    return poly
In [ ]:
# create empty dataframe for merging all data
df_iter = pd.DataFrame()

## main function for retrieving data
def crime_api(df_iter, date, areas):
    for a in areas:
        ## Filter by area name into new dataframe
        df_b = df_boroughs[df_boroughs['name'] == a]

        ## len(poly) exceeds character limit so call api with post request
        # base url endpoint for all crimes data
        base = 'https://data.police.uk/api/crimes-street/all-crime?'
        poly = get_coords(df_b) # call function to prep coordinates

        ## call api and raise any status errors
        r = requests.post(base, data={'poly': poly, 'date': date})
        r.raise_for_status()

        ## read array into dataframe
        df = pd.json_normalize(r.json())
        df['Borough'] = a       # add new column with borough identifier

        ## vertically join new data with any previous data
        df_iter = pd.concat([df_iter, df], ignore_index=True)
    return df_iter
In [ ]:
### **** SET PARAMETERS ****

# set start and end dates in format: YYYY-MM, must be string.
min_month = '2022-03'   ## started at 2019-10
max_month = '2022-09'

# period_range function returns a fixed step index between dates specified as monthly frequency
date_range = pd.period_range(min_month, max_month, freq='M')
# convert period_index into a list of dates that can be iterated on
date_list = list(date_range.astype(str))

## create empty dataframe for merging all data
# df_main = pd.DataFrame()      # ** uncomment if first run
# iterate through each date in period index, extracting into string data format
for date in date_list:
    # assign updated dataframe on each iteration
    df_main = pd.concat([df_main, crime_api(df_iter, date, areas)], ignore_index=True)

Note: for one Borough, time between 70-100 seconds. - > Est. 45-60 minutes for entire run¶

-> For testing use code below¶

When running the full download, sometimes a 500 server error code was returned. It seems random and unrelated to any api limits, so believe it to be issue on API end.

The frequency of this was reduced greatly by shortening coordinates to 6dp.

Workaround: since df_main is only updated when monthly data for all boroughs is found, if this error is returned before a full run is complete, the code can be restarted with min_month set after the most recent data in df_main.

In [ ]:
## ** TEST CODE **
# proof of concept: 2 areas, 1 time period

areas = ['Bromley', 'Ealing']

date = '2020-05'
df_test = pd.DataFrame()
df_test = pd.concat([df_test, crime_api(df_iter, date, areas)], ignore_index=True)
In [ ]:
df_test.sample(5)
Out[ ]:
category location_type context outcome_status persistent_id id location_subtype month location.latitude location.street.id location.street.name location.longitude outcome_status.category outcome_status.date Borough
1784 anti-social-behaviour Force NaN 83912932 2020-05 51.380602 928475 On or near Eresby Drive -0.026893 NaN NaN Bromley
5606 bicycle-theft Force NaN e3ede68b0a7682a0f664a085494d175b58eed1f452cc92... 83982065 2020-05 51.522892 959657 On or near Montpelier Road -0.302663 Investigation complete; no suspect identified 2020-05 Ealing
2221 drugs Force NaN de8f36a7278f2d1ed21586c15802acb3ac17ce3a1b6d33... 83982961 2020-05 51.392407 931158 On or near Petrol Station 0.002407 Court result unavailable 2020-11 Bromley
1593 anti-social-behaviour Force NaN 83880433 2020-05 51.401006 931643 On or near Saxville Road 0.108425 NaN NaN Bromley
4043 anti-social-behaviour Force NaN 83891351 2020-05 51.512141 958375 On or near Uxbridge Road The Broadway -0.383118 NaN NaN Ealing

1.3 - Download crime data with missing locations¶

Some crimes cannot be mapped to a location, for instance, if the victim cannot recall the location of the crime. The crimes missing this location data can be return by police force, again at a monthly frequency.

This could be a hindrance to any targeted geographical analysis and necessitates comparing with the location crime data to ensure it represents a relatively small part of the overall data.

https://data.police.uk/api/crimes-no-location?category=all-crime&force=metropolitan

In [ ]:
# set start and end dates in format: YYYY-MM, must be string.
min_month = '2019-10'   ## starts from 2019-10 as of 03/12/22. if error code 404 is returned it is likely because this month is unavailable now (shifting 3 year window, so use later month) 
max_month = '2022-09'

months_pr = pd.period_range(min_month, max_month, freq='M')        # fill in monthly dates
months = list(months_pr.astype(str))        # convert period_range into a list of dates that can be iterated on
forces = ['metropolitan', 'city-of-london']     # list of London police forces
missing = pd.DataFrame()     # empty df for merging data

# loop over dates
for month in months:
    for force in forces:
        base = 'https://data.police.uk/api/crimes-no-location?category=all-crime&force=%s' % (force)
        ## call api and raise any status errors
        r = requests.post(base, data={'date': month})
        r.raise_for_status()

        df_temp = pd.json_normalize(r.json())      # read array into dataframe
        missing = pd.concat([missing, df_temp], ignore_index=True)      # vertically join new data with any previous data

# drop unnecessary columns
missing.drop(['location_type', 'location', 'context', 'outcome_status', 'persistent_id', 'id', 'location_subtype', 'outcome_status.category', 'outcome_status.date'], axis=1, inplace=True)

This calls the api for the desired date range and for both London's police forces. Since we are only interested in the number of crimes for each type, all the unnecessary columns are dropped, leaving only 'month' and 'category' values. This can then be aggregated later by month / type.

1.4 - Download crime-category types¶

In [ ]:
## call api endpoint returning crime categories (can use to format names later)
url = 'https://data.police.uk/api/crime-categories?date=2022-09'
cat_data = requests.get(url).json()
df_cat = pd.json_normalize(cat_data)

1.5 - Ready to save datasets¶

Main CSV file size roughly 600mb, far above the 100mb limit for GitHub file uploads.

-> Even with xz/bz2 compression, files were roughly 110-120mb. But dataset contains wasted space (ie empty/unnecessary columns) that would be deleted later in cleaning. So delete columns prior to exporting dataset.

Files saved in data folder - later analysis has date ranges available on the API as of 03/12/22, so don't overwrite as may change results.

In [ ]:
## drop unnecessary columns 

# df_main.drop(['location_type', 'context', 'outcome_status', 'persistent_id', 'location_subtype', 'location.street.id'], axis=1, inplace=True)
In [ ]:
## export downloaded data as csv file

# df_main.to_csv('data/all_crimes.csv', index = False)      # for standard csv write
# df_main.to_csv('data/all_crimes.bz2', index=False)          # for compressed csv write

# missing.to_csv('data/crimes_missing.csv', index=False)
# df_cat.to_csv('data/crime_categories.csv', index = False)

Using xz and bz2 compression types brought file sizes down to 31-33mb, however bz2 write time was roughly 3 times faster than xz so proceeding with this.

1.6 - Other data sources¶

Borough specific data has been gathered externally:

  • deprivation data from ONS - 2019 Mapping income deprivation at a Borough and local authority level
  • population data from ONS - 2021 Census first release

Each was downloaded from source in excel format, filtered for London boroughs only, and saved as a csv in the external_data folder. These are loaded in when required.


Step 2 - Data preparation and cleaning¶

In [ ]:
crimes_raw_df = pd.read_csv('data/all_crimes.bz2')      # roughly 7s, read main dataset, pandas automatically decompresses file
In [ ]:
crimes_raw_df
Out[ ]:
category id month location.latitude location.street.name location.longitude outcome_status.category outcome_status.date Borough
0 anti-social-behaviour 78702917 2019-10 51.411857 On or near Supermarket -0.300998 NaN NaN Kingston upon Thames
1 anti-social-behaviour 78702919 2019-10 51.411857 On or near Supermarket -0.300998 NaN NaN Kingston upon Thames
2 anti-social-behaviour 78702920 2019-10 51.414177 On or near Nightclub -0.301027 NaN NaN Kingston upon Thames
3 anti-social-behaviour 78702921 2019-10 51.411260 On or near Nipper Alley -0.300761 NaN NaN Kingston upon Thames
4 anti-social-behaviour 78702922 2019-10 51.403324 On or near Bloomfield Road -0.299847 NaN NaN Kingston upon Thames
... ... ... ... ... ... ... ... ... ...
3386726 other-crime 104960363 2022-09 51.516271 On or near Aldermanbury -0.092971 Under investigation 2022-09 City of London
3386727 other-crime 104960391 2022-09 51.513631 On or near Finch Lane -0.086019 Under investigation 2022-09 City of London
3386728 other-crime 104960233 2022-09 51.517770 On or near -0.078495 Under investigation 2022-09 City of London
3386729 other-crime 104960109 2022-09 51.517656 On or near Sandy's Row -0.077563 Offender given a caution 2022-09 City of London
3386730 other-crime 104960442 2022-09 51.510559 On or near Queenhithe -0.095054 Under investigation 2022-09 City of London

3386731 rows × 9 columns

Dataset contains 3,386,731 rows, each representing a crime reported in the London area in the 36 months leading to September 2022. Each row details crime type, location type (ie jurisdiction is normal police force or transport police), crime ids unique for API, multiple location data and crime outcomes if any (historical data is updated regularly to match with police and court outcomes).

In [ ]:
crimes_raw_df.columns
Out[ ]:
Index(['category', 'id', 'month', 'location.latitude', 'location.street.name',
       'location.longitude', 'outcome_status.category', 'outcome_status.date',
       'Borough'],
      dtype='object')
In [ ]:
## read data for crime incidents not mapped to a location, each row represents an incident.
crimes_missing = pd.read_csv('data/crimes_missing.csv')
crimes_missing
Out[ ]:
month category
0 2019-10 anti-social-behaviour
1 2019-10 anti-social-behaviour
2 2019-10 anti-social-behaviour
3 2019-10 anti-social-behaviour
4 2019-10 anti-social-behaviour
... ... ...
44401 2022-09 other-crime
44402 2022-09 other-crime
44403 2022-09 other-crime
44404 2022-09 other-crime
44405 2022-09 other-crime

44406 rows × 2 columns

2.1 Cleaning¶

In [ ]:
# rename columns 
crimes_raw_df.columns = ['Crime Category', 'Crime ID', 'Month', 'Latitude', 'Street Name', 'Longitude', 'Outcome', 'Outcome Date', 'Borough']

# format crime categories from url name to nice name, first read csv of category names
cat_df = pd.read_csv('data/crime_categories.csv', index_col = 0)
dict = cat_df.to_dict()     # create dictionary from dataframe

crimes_df = crimes_raw_df.replace({'Crime Category': dict['name']})     # use the dictionary to replace all categories names
crimes_missing = crimes_missing.replace({'Crime Category': dict['name']})
In [ ]:
# verify no duplicate crimes present, e.g. potentially from area boundaries when downloading from api
crimes_df['Crime ID'].duplicated().sum()
Out[ ]:
0

2.2 - Enrich data¶

Add features extracted from GeoJSON

In [ ]:
## merges dataframes using borough names as key
crimes_df = pd.merge(crimes_df, borough_feat, on='Borough')

Add population data, Borough Population CSV extracted from ONS 2021 Census Results first release

Can use later for calculating crime rates

In [ ]:
pop_df = pd.read_csv('external_data/Borough_pop_census2021.csv', index_col=0)

crimes_df = pd.merge(crimes_df, pop_df, on='Borough')
In [ ]:
# check column data types
crimes_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3386731 entries, 0 to 3386730
Data columns (total 14 columns):
 #   Column          Dtype         
---  ------          -----         
 0   Crime Category  object        
 1   Crime ID        int64         
 2   Month           object        
 3   Latitude        float64       
 4   Street Name     object        
 5   Longitude       float64       
 6   Outcome         object        
 7   Outcome Date    object        
 8   Borough         object        
 9   Hectares        float64       
 10  Inner           bool          
 11  Area            object        
 12  Date            datetime64[ns]
 13  Year            int64         
dtypes: bool(1), datetime64[ns](1), float64(3), int64(2), object(7)
memory usage: 365.0+ MB

Columns with dates are formatted as objects, so create new column with date format and another column with only the year for later analysis.

In [ ]:
crimes_df['Date'] = pd.to_datetime(crimes_df['Month'], format='%Y-%m')  # create new column with datetime type
crimes_df['Year'] = crimes_df['Date'].dt.year   # create new column holding only relevant year value
In [ ]:
crimes_df.sample(5)
Out[ ]:
Crime Category Crime ID Month Latitude Street Name Longitude Outcome Outcome Date Borough Hectares Inner Area population Date Year
603107 Anti-social behaviour 86403184 2020-08 51.481915 On or near Caroline Place -0.430575 NaN NaN Hillingdon 11570.063 False West 305900 2020-08-01 2020
2309311 Violence and sexual offences 80600573 2020-01 51.515761 On or near Stourcliffe Street -0.162390 Investigation complete; no suspect identified 2020-03 Westminster 2203.005 True Central 204300 2020-01-01 2020
811447 Burglary 92790017 2021-05 51.561309 On or near Sudbury Hill Close -0.328500 Investigation complete; no suspect identified 2021-05 Brent 4323.270 False West 339800 2021-05-01 2021
1037258 Anti-social behaviour 88453957 2020-11 51.479830 On or near Old South Lambeth Road -0.123439 NaN NaN Lambeth 2724.940 True Central 317600 2020-11-01 2020
2524893 Violence and sexual offences 102071443 2020-03 51.517400 Holborn (lu Station) -0.120207 Status update unavailable 2020-07 Camden 2178.932 True Central 210100 2020-03-01 2020

The data straddles the Covid period and thus exploring lockdown effects on crime may be interesting.

lockdown timeline available here - https://www.instituteforgovernment.org.uk/charts/uk-government-coronavirus-lockdowns

Considering full months:

  • Lockdown 1: April 2020 -> May 2020
  • Lockdown 2: November 2020
  • Lockdown 3: January 2021 -> March 2021

These can then be used as filters when required.

Can also define three 12-month periods

In [ ]:
L1 = pd.date_range(start='2020-04-01', end='2020-05-31' ,freq='MS')
L2 = pd.date_range(start='2020-11-01', end='2020-11-30' ,freq='MS')
L3 = pd.date_range(start='2021-01-01', end='2021-03-31' ,freq='MS')

Y1 = pd.date_range(start='2019-10-01', end='2020-09-30' ,freq='MS')
Y2 = pd.date_range(start='2020-10-01', end='2021-09-30' ,freq='MS')
Y3 = pd.date_range(start='2021-10-01', end='2022-09-30' ,freq='MS')
In [ ]:
P1 = pd.period_range(start='2019-10-01', end='2020-09-30' ,freq='M')
P2 = pd.period_range(start='2020-10-01', end='2021-09-30' ,freq='M')
P3 = pd.period_range(start='2021-10-01', end='2022-09-30' ,freq='M')
In [ ]:
## EG
crimes_df[crimes_df['Date'].isin(Y1)]

Anti-social behaviour isn't recorded in total crime stats so can create new dataframe without, keeping the full one for relevant questions.

  • incidents -> for crimes + ASB incidents
  • crimes -> for crimes only
In [ ]:
## drop anti-social behaviour
incidents = crimes_df.copy()
crimes = crimes_df[(crimes_df['Crime Category'] != 'Anti-social behaviour')].reset_index(drop=True)

Step 3 - Perform exploratory analysis¶

The main dataframes have been left unaggregated so can be combined / grouped when necessary.

  • group by month, year etc
  • group by borough
  • group by crime type
  • any combination
In [ ]:
crimes.sample(5)
Out[ ]:
Crime Category Crime ID Month Latitude Street Name Longitude Outcome Outcome Date Borough Hectares Inner Area population Date Year
1430494 Violence and sexual offences 102523035 2022-06 51.425407 On or near High Street -0.219138 Under investigation 2022-06 Merton 3762.466 False South 215200 2022-06-01 2022
1029118 Violence and sexual offences 99050329 2022-01 51.464911 On or near Halsbrook Road 0.036605 Investigation complete; no suspect identified 2022-03 Greenwich 5044.190 False East 289100 2022-01-01 2022
1714438 Violence and sexual offences 94043766 2021-07 51.513664 On or near Frith Street -0.131575 Investigation complete; no suspect identified 2021-07 Westminster 2203.005 True Central 204300 2021-07-01 2021
2267228 Violence and sexual offences 84607321 2020-06 51.525729 On or near Marlow Road 0.054642 Investigation complete; no suspect identified 2020-06 Newham 3857.806 True East 351100 2020-06-01 2020
806669 Vehicle crime 81095047 2020-02 51.483669 On or near John Ruskin Street -0.094404 Court result unavailable 2020-08 Southwark 2991.340 True Central 307700 2020-02-01 2020

Monthly Crime Totals¶

First can look at how total incidents are behaving across the data range.

In [ ]:
## create new dataframes with monthly crime totals, rename & add indicator columns, then vertically concat
exp1a = pd.DataFrame(incidents['Month'].value_counts()).reset_index()
exp1b = pd.DataFrame(crimes['Month'].value_counts()).reset_index()

exp1a.columns = ['Month', 'Crimes']
exp1b.columns = ['Month', 'Crimes']

exp1a['Measure'] = 'Crimes (incl. anti-social behaviour)'
exp1b['Measure'] = 'Crimes'

exp1 = pd.concat([exp1a, exp1b], ignore_index=True)
In [ ]:
alt.Chart(exp1).mark_line(point=True).encode(
    x = alt.X('Month:T', title=None, axis=alt.Axis(grid=False)),
    y = alt.Y('Crimes:Q', title=None),
    color = alt.Color('Measure:N', legend=alt.Legend(orient='bottom-right')),
    tooltip = [alt.Tooltip('Date:T', title='Month', format='%b %Y'), alt.Tooltip('Crimes:Q', format=',')]
).properties(
    width = 600,
    title = 'London: Monthly Crime Totals'
)
Out[ ]:

Crime rates roughly consistent over the period, with a sharp drops around the lockdown periods (except for anti-social behaviour) but no clear trend.

Yearly Totals¶

How does this translate to a yearly total? Will consider our three 12-month periods.

In [ ]:
# create function to apply to DF to add column value if in certain date range

def add_period(df):
    if df['Month'] in (list(P1.astype(str))):
        return '2019-20'
    elif df['Month'] in (list(P2.astype(str))):
        return '2020-21'
    elif df['Month'] in (list(P3.astype(str))):
        return '2021-22'
In [ ]:
exp1['Period'] = exp1.apply(add_period, axis = 1)
In [ ]:
alt.Chart(exp1).mark_bar(size=40, opacity=0.7).encode(
    x = alt.X('Period:O', title=None, axis=alt.Axis(labelAngle=-30, labelOffset=30)),
    y = alt.Y('sum(Crimes):Q', stack=None, title=None, scale=alt.Scale(domain=[0, 1400000])),
    color = alt.Color('Measure:N', title=None, legend=alt.Legend(orient='top-right')),
).properties(
    width = 300,
    title = 'London: Yearly Crime Totals'
)
Out[ ]:

The grey bar represents only crimes while the grey + yellow represent all crime incidents.

The year-to-2021 and year-to-2022 totals are similar, however the decrease in anti-social behaviour incidents masks a slight rise in crime. Although the period two periods both contained lockdowns so not trend conclusions can be drawn yet.

Crime Categories¶

Identify which crimes are most and least common across the whole 36 month period.

In [ ]:
## Each row in main represents an incident, so counting each unique instance of a Crime Category will give total instances per category.
exp3 = pd.DataFrame(incidents['Crime Category'].value_counts()).reset_index()
exp3.columns = ['Category', 'Total Incidents']

Most frequent crime types can be visualised with a bar chart

In [ ]:
alt.Chart(exp3).mark_bar().encode(
    x = alt.X('Total Incidents:Q'),
    y = alt.Y('Category:N', sort = '-x', title=None),
    tooltip = [alt.Tooltip('Total Incidents:Q', format=',')]
).properties(
    title = 'London: Total Crime Incidents (Oct-2019 to Sep-2022)'
)
/Users/joshhellings/miniforge3/envs/datasci/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
Out[ ]:

From the ONS: 'Violent crime covers a wide range of offences including minor assaults (such as pushing and shoving), harassment and abuse (that result in no physical harm) through to wounding and homicide. Sexual offences include rape, sexual assault and unlawful sexual activity against adults and children, sexual grooming and indecent exposure.'

ASB shows by far the highest total incidents, expected as it is not typically included in crime stats. Violence and sexual offences includes lots of sub-categories so liekly also explains why this is very high.

The possession of weapons in London is often talked about as a an epidemic, however we can see it makes up a tiny proportion of crimes with 16,557 recorded incidents over the last 3 years.

Find dispersion among crime types¶

In [ ]:
# find total number of monthly crime incidents for each borough
exp4 = crimes.groupby(['Crime Category', 'Month'], as_index=False)['Month'].value_counts()
In [ ]:
alt.Chart(exp4).mark_boxplot(size=50, ticks=True).encode(
    x = alt.X('Crime Category:N', title=None, axis=alt.Axis(labels=False, ticks=False), scale=alt.Scale(padding=1)), 
    y = alt.Y('count:Q', title=None), 
    color = alt.Color('Crime Category:N', legend=None),
    facet = alt.Facet('Crime Category:O', columns=7, title='London: Monthly crime dispersion in the last 3 years'),
).properties(
    width=100,
    height = 150
).resolve_scale(
    y='independent',
    x='independent'
)
Out[ ]:

This boxplot shows how the monthly crime totals vary for each crime category - in the 36 month period Oct-2019 to Sep-2022.

The whiskers are extended to the furthest points within 1.5 x the interquartile range, from the 1st and 3rd quartile. Any outliers to this are plotted as well.

Theft from the person and other theft shows the biggest dispersion, likely due to the lockdowns. Similarly, bicycle theft shows a big dispersion weighted upwards, suggesting a few very high monthly values.

How does crime vary across London?¶

To visualise this, we can use the initial London borough map geojson and combine with stats aggregated at the borough level. Then can display as a chloropleth map to any identify high-crime clusters.

In [ ]:
## counts unique instances of borough name, ie number of crimes reported by borough across the whole period.
borough_freq = pd.DataFrame(crimes['Borough'].value_counts()).reset_index()
borough_freq.columns = ['Borough', 'Total Incidents']
In [ ]:
# uses Altair package to plot a Chloropleth map of crime incidents by London Borough. 
# initial geoDataFrame from arcgis api is used as a base map, with a data lookup to get crime incidents from dataframe.

alt.Chart(df_boroughs).mark_geoshape().encode(
    color='Total Incidents:Q',
    tooltip= [alt.Tooltip('name:N', title='Borough'), alt.Tooltip('Total Incidents:Q', format=',')]
).transform_lookup(
    lookup='name',
    from_=alt.LookupData(borough_freq, 'Borough', ['Total Incidents'])
).project(
    type='mercator'
).properties(
    width=500,
    height=300,
    title='London: Mapping 3 years of crime'
)
Out[ ]:

Generally shows higher crime totals in inner London, with City of London an outlier due its smaller area and population. To better compare across Boroughs it would be beneficial to combine with population data and calculate crime rates.

Finding crime rates¶

Borough Population CSV extracted from ONS 2021 Census Results first release.

Crime rates typically found as a yearly rate, so will extract only 2021 crime data.

In [ ]:
## create boolean variable for when year is 2021, then apply this to main df as a filter, leaving only crimes reported in 2021.
# is_2021 = crimes_df['Year'] == 2021
crimes_2021 = crimes[crimes['Year'] == 2021]
crimes_2021.sample(3)
Out[ ]:
Crime Category Crime ID Month Latitude Street Name Longitude Outcome Outcome Date Borough Hectares Inner Area population Date Year
1411524 Vehicle crime 89907023 2021-01 51.415044 On or near Garden Avenue -0.150958 Status update unavailable 2021-05 Merton 3762.466 False South 215200 2021-01-01 2021
1004630 Other theft 90704636 2021-02 51.488995 On or near Parking Area 0.069426 Investigation complete; no suspect identified 2022-01 Greenwich 5044.190 False East 289100 2021-02-01 2021
838831 Shoplifting 91448735 2021-03 51.471558 On or near Cerise Road -0.068112 Investigation complete; no suspect identified 2021-03 Southwark 2991.340 True Central 307700 2021-03-01 2021
In [ ]:
## apply to find incidents for each borough in 2021
borough_freq_2021 = pd.DataFrame(crimes_2021['Borough'].value_counts()).reset_index()
borough_freq_2021.columns = ['Borough', 'Total Incidents']
borough_freq_2021.head()
Out[ ]:
Borough Total Incidents
0 Westminster 49294
1 Newham 32807
2 Croydon 31917
3 Tower Hamlets 31569
4 Lambeth 31526

Now import population data, merge the files using the shared borough names, and calculate crime rates as per 1,000 population.

City of London population is only 8600 people, but working/day population is as high as 500,000, so will drastically inflate crime rates, as such can drop that from the data.

In [ ]:
# Borough Population CSV extracted from ONS 2021 Census Results first release
pop_df = pd.read_csv('/Users/joshhellings/Documents/OneDrive - University of Bristol/FinTech/SDPA/CW_part2/external_data/Borough_pop_census2021.csv', index_col = 0)
crime_rates = pd.merge(borough_freq_2021, pop_df, on=['Borough'])     # create empty population column
crime_rates['Crime Rate'] = (1000*crime_rates['Total Incidents']) / crime_rates['population']
crime_rates = crime_rates.iloc[:-1,:].copy()     ## Drop last row, ie City of London 
In [ ]:
# as before, but now using crime rate for 2021 rather than total incidents.
alt.Chart(df_boroughs).mark_geoshape().encode(
    color='Crime Rate:Q',
    tooltip= [alt.Tooltip('name:N', title='Borough'), alt.Tooltip('Crime Rate:Q', format='d', title='Crime Rate per 1,000 people')]
).transform_lookup(
    lookup='name',
    from_=alt.LookupData(crime_rates, 'Borough', ['Crime Rate'])
).project(
    type='mercator'
).properties(
    width=500,
    height=300,
    title='London: 2021 Crime rates by borough (per 1,000 people)'
)
Out[ ]:

Crime rates now clearly higher for inner London, with Westminster by far the highest. The high tourist population is generally blamed for the crime rates in the Borough of Westminster. Also, this visualisation uses the newest 2021 Census borough population, which for Westminster has actually decreased since 2011, thereby inflating its crime rate further.

Is this reflected in the type of crimes committed?

Westminster crime distribution¶

In [ ]:
## create boolean variable for when year is 2021, then apply this to main df as a filter, leaving only crimes reported in 2021.
Westmin_df = pd.DataFrame(crimes[crimes['Borough'] == 'Westminster']['Crime Category'].value_counts()).reset_index()
Westmin_df.columns = ['Category', 'Total Incidents'] 

Find crimes per type for the whole of London

In [ ]:
filt = crimes[crimes['Year'] == 2021]
Lon2021 = pd.DataFrame(filt['Crime Category'].value_counts()).reset_index()
Lon2021.columns = ['Category', 'Total Incidents']

Plot crime incidents by type in Westminster in 2021

In [ ]:
Westmin = alt.Chart(Westmin_df).mark_bar().encode(
    x = alt.X('Total Incidents:Q'),
    y = alt.Y('Category:N', sort = '-x', title=None),
    tooltip = [alt.Tooltip('Total Incidents:Q', format=',')]
).properties(
    title = 'Westminster: Crime in 2021'
)

London_av = alt.Chart(Lon2021).mark_bar().encode(
    x = alt.X('Total Incidents:Q'),
    y = alt.Y('Category:N', sort = '-x', title=None),
    tooltip = [alt.Tooltip('Total Incidents:Q', format=',')]
).properties(
    title = 'London: Crime in 2021'
)

Westmin | London_av
Out[ ]:

Crime outcomes¶

Lastly, can look at crime outcomes

In [ ]:
crimes['Outcome'].value_counts(normalize=True).round(3)
Out[ ]:
Investigation complete; no suspect identified          0.630
Status update unavailable                              0.201
Under investigation                                    0.076
Court result unavailable                               0.039
Local resolution                                       0.031
Offender given penalty notice                          0.007
Offender given a caution                               0.007
Awaiting court outcome                                 0.007
Unable to prosecute suspect                            0.001
Offender given a drugs possession warning              0.000
Formal action is not in the public interest            0.000
Action to be taken by another organisation             0.000
Further investigation is not in the public interest    0.000
Suspect charged as part of another case                0.000
Further action is not in the public interest           0.000
Name: Outcome, dtype: float64

By far, most crimes result in no suspect being identified.


Step 4 - Ask questions of your data¶

4.1. Are crime rates decreasing?¶

In [ ]:
## create new dataframes with monthly crime totals, rename & add indicator columns, then vertically concat
q1_1a = pd.DataFrame(incidents['Month'].value_counts()).reset_index()
q1_1b = pd.DataFrame(crimes['Month'].value_counts()).reset_index()

q1_1a.columns = ['Month', 'Crimes']
q1_1b.columns = ['Month', 'Crimes']

q1_1a['Measure'] = 'Crimes (incl. anti-social behaviour)'
q1_1b['Measure'] = 'Crimes'

q1_1 = pd.concat([q1_1a, q1_1b], ignore_index=True)
In [ ]:
alt.Chart(q1_1).mark_line(point=True).encode(
    x = alt.X('Month:T', title=None, axis=alt.Axis(grid=False)),
    y = alt.Y('Crimes:Q', title=None),
    color = alt.Color('Measure:N', legend=alt.Legend(orient='bottom-right')),
    tooltip = [alt.Tooltip('Date:T', title='Month', format='%b %Y'), alt.Tooltip('Crimes:Q', format=',')]
).properties(
    width = 600,
    title = 'London: Monthly Crime Totals'
)
Out[ ]:

No obvious long term trend, crime and anti-socials behaviour incidents track eachother except for March -> June 2020.

Since the middle of 2021 crime totals have been mostly stable around 70,000, with all incidents stable between 80,000 and 100,000. Crime incidents spiked in April and May of 2020, the first two full months of nationally imposed lockdowns.

Incidents of anti-social behaviour clearly spike in first few months of nationally imposed lockdown, likely because covid breaches were typically reported as ASB.

Have the types of crimes being committed changed?¶

To see how crime incidents have changed over the data period, we can use a similar multi-line chart as before with a couple key changes:

  • calculate rolling window mean on crime counts to remove some expected noise between months
  • index at the first time period so can better compare how crime patterns are developing

i) iterate through each crime type and calculate three month rolling mean for crimes of that type. Such that calculated 2020-01 value is a mean of 2019-11, 2019-12, and 2020-01 total monthly crime values

In [ ]:
## calculate monthly crime total per crime type
cat_monthly = crimes.groupby(['Crime Category', 'Month'], as_index=False)['Month'].value_counts()

## create list of unique crime types to iterate over
types = list(cat_monthly['Crime Category'].unique())

df_roll = pd.DataFrame()
for t in types:
    subset = cat_monthly[(cat_monthly['Crime Category'] == t)].reset_index(drop=True)
    subset['Crimes (3-month average)'] = subset.rolling(3).mean()
    subset = subset.iloc[2:,:].copy()   # drop first two rows
    df_roll = pd.concat([df_roll, subset], ignore_index=True)

df_roll
Out[ ]:
Crime Category Month count Crimes (3-month average)
0 Bicycle theft 2019-12 1030 1363.666667
1 Bicycle theft 2020-01 1251 1205.333333
2 Bicycle theft 2020-02 1123 1134.666667
3 Bicycle theft 2020-03 1128 1167.333333
4 Bicycle theft 2020-04 1080 1110.333333
... ... ... ... ...
437 Violence and sexual offences 2022-05 23025 21774.000000
438 Violence and sexual offences 2022-06 21682 21671.666667
439 Violence and sexual offences 2022-07 22883 22530.000000
440 Violence and sexual offences 2022-08 21647 22070.666667
441 Violence and sexual offences 2022-09 20513 21681.000000

442 rows × 4 columns

ii) Index values, again by taking a subset for each crime type, then divide every observation by the first value

In [ ]:
## calculate index for crime count = 100 at t1
index = []
for t in types:
    subset = df_roll[(df_roll['Crime Category'] == t)].reset_index(drop=True)
    indexrow = subset[:1]      ## select first row to be divisor
    subset['Rolling crime average'] = (subset['Crimes (3-month average)']/ indexrow['Crimes (3-month average)'][0]) * 100
    index.extend(list(subset['Rolling crime average']))

df_roll['Index'] = index    # add index to original DF

df_roll.head()
Out[ ]:
Crime Category Month count Crimes (3-month average) Index
0 Bicycle theft 2019-12 1030 1363.666667 100.000000
1 Bicycle theft 2020-01 1251 1205.333333 88.389147
2 Bicycle theft 2020-02 1123 1134.666667 83.207040
3 Bicycle theft 2020-03 1128 1167.333333 85.602542
4 Bicycle theft 2020-04 1080 1110.333333 81.422635

Plot all index results on a multi-line chart

In [ ]:
alt.Chart(df_roll).mark_line(point='transparent').encode(
    x = alt.X('Month:O', title=None),
    y = alt.Y('Index:Q', title=None),
    color = alt.Color('Crime Category:N')
).properties(
    title = 'London: Monthly crime incidents (3 month rolling average)',
    width = 600,
    height = 300
)
Out[ ]:

Interestingly, the crimes that diverged during the first lockdown period have remained mostly separated - i.e., mostly stay on either side of y=95.

This also shows some seasonality, most clearly evident with bicycle theft, but also noticable with public order - may also be true of some other categories but lockdowns likely to cause simialr behaviour.

Bike theft in the summer of 2020 was around 50% higher than subsequent years, and this is likely a result of the cycling becoming especially popular during and after the first lockdown, with new bikes scarcely available and prices far higher than normal.

  • group if there is time -> so all theft combined
  • or highlight selection

How have crime incidents varied on a yearly scale?¶

Since we have three years of data but not three full calender years, can use three 12-month periods instead.

In [ ]:
## use function from part 3 to add new column with relevant yearly period (P1, P2, P3)
cat_monthly['Year'] = cat_monthly.apply(add_period, axis=1)
cat_monthly.sample(3)
Out[ ]:
Crime Category Month count Year
56 Burglary 2021-06 4232 2020-21
327 Shoplifting 2020-01 4008 2019-20
266 Public order 2020-12 4106 2020-21

Now can find the number of crime incidents average monthly across the each 12-month period.

In [ ]:
cat_yearly = cat_monthly.groupby(['Crime Category', 'Year'], as_index=False)['count'].mean().round(0)
cat_yearly.sample(3)
Out[ ]:
Crime Category Year count
2 Bicycle theft 2021-22 1630.0
29 Shoplifting 2021-22 3111.0
25 Robbery 2020-21 1906.0

All the data is now in a dataframe so can be displayed. For this can used a faceted line plot with points for each of the yearly periods.

In [ ]:
alt.Chart(cat_yearly).mark_line(point=True).encode(
    x = alt.X('Year:O', title=None, axis=alt.Axis(labelAngle=-40, labelOffset=10)),
    y = alt.Y('count:Q', title=None),
    color = alt.Color('Crime Category:N'),
    tooltip = [alt.Tooltip('Crime Category:N'), alt.Tooltip('Year:N'), alt.Tooltip('count:Q', title='Avergae Monthly Incidents', format=',d')],
    facet = alt.Facet('Crime Category:O', columns=7, title='London: Yearly crime incidents (averaged monthly)'),
).properties(
    width = 100,
    height = 100
).resolve_scale(
    y='independent'
)
Out[ ]:

So, when considering each category for the yearly periods (Oct through Sep), there have been some interesting changes:

Burglary and robbery have both trended down slightly, but theft from the persona and other theft ahs increased significantly - potentially lockdown effects.

Have these changes been evenly spread among boroughs?¶

So we have shown that the relatively stable overall crime rate actually masks alot of changes between crime categories. Now we can consider if this is masking even greater changes among boroughs.

For this, will repeat similar steps as before to find incidents across each yearly period (using yearly total now rather than average), but now also grouping by borough.

In [ ]:
bor_monthly = crimes.groupby(['Borough', 'Crime Category', 'Month'], as_index=False)['Month'].value_counts()

## use function from part 3 to add new column with relevant yearly period (P1, P2, P3)
bor_monthly['Year'] = bor_monthly.apply(add_period, axis=1)

## find total yearly crime incidents for each category in each borough 
bor_yearly = bor_monthly.groupby(['Borough', 'Crime Category', 'Year'], as_index=False)['count'].sum()

Now for each borough, can find the change between P1 and P3 in each crime type.

  • To make this easier, will first use pivot to change the data to wide form with yearly columns
In [ ]:
bor_wide = pd.pivot(bor_yearly, index=['Borough', 'Crime Category'], columns=['Year'], values='count')

bor_wide.reset_index(inplace=True)      # reset to remove multi-index
bor_wide.rename_axis(None, axis=1)      # remove name from index

# calculate percentage change in two year period
bor_wide['Two-year Change'] = (bor_wide['2021-22'] - bor_wide['2019-20']) / bor_wide['2019-20']

# drop City of London data as very sensitive to change
bor_wide = bor_wide[(bor_wide['Borough'] != 'City of London')]

Now for each category, find the greatest positive or negative changes

  • iterate through each type to get a subset of the dataframe - will have a row for each borough
  • sort by absolute two-year change values
  • extract ones with biggest changes
In [ ]:
df_highlow = pd.DataFrame()

for t in types:
    subset = bor_wide[(bor_wide['Crime Category'] == t)].reset_index(drop=True)
    subset = subset.sort_values(by=['Two-year Change'], ascending=False, ignore_index=True)
    subset = pd.concat([subset.head(3), subset.tail(3)], ignore_index=True)       ## take 3 highest and lowest 
    df_highlow = pd.concat([df_highlow, subset], ignore_index=True)

Plot all the results on separate bar charts, diverging scale used to show the differences

In [ ]:
alt.Chart(df_highlow).mark_bar().encode(
    y = alt.Y('Borough:N', sort='-x', title = None),
    x = alt.X('Two-year Change:Q', axis=alt.Axis(format='%'), title=None),
    color = alt.Color('Two-year Change', scale=alt.Scale(scheme='blueOrange'), legend=None),
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Two-year Change:Q', format='.1%'), alt.Tooltip('2021-22:Q', title='Total Incidents')],
    facet = alt.Facet('Crime Category:O', columns=5, title='London: Which boroughs have seen the biggest changes in crime in the last two years?'),
).properties(
    width = 120,
    height = 100
).resolve_scale(
    y='independent',
    x='independent',
    color='independent'
)
Out[ ]:

The chart above plots the 3 biggest positive and negative changes in total crime incidents across the two-year period: ie from 2019-20 to 2021-22.

This shows again that observing relatively small changes in crime but aggregated across the whole of London, can mask much larger changes on a smaller level.

Westminster topped the charts in three areas: other theft, public order, violence and sexual offences. Potentially as its high crime incidents are are tied to the high day and tourist population, which dwinded in the first period due to covid lockdowns.

Burglary has falled across every borough, suggesting maybe a structural impact of covid, for example more people working from home, has had an effect.

Conversely, violence and sexual offences have increased in every borough.


4.2. Is there a difference between crime types of the poorest and richest boroughs?¶

...and if so, can any relationships be drawn?

The ONS publishes the English Indices of Deprivation every 3-5 years (latest available 2019), that collates multiple measures of inequality at the LAD and LSOA level. London boroughs make up London's Local Authority Districts so can use this to merge the datasets. Similarly to population, this data is not especially fast moving, so we shouldn't expected any significant effects from measurement periods not exactly aligning.

We can first filter the crime dataset to a yearly period - will use full dataset with anti-social behaviour. August 2021 -> July 2022 was chosen as the last UK covid restrictions were lifted in July 2021, so choosing this period, rather than 2021 or any earlier, should remove/reduce possibility of covid effects.

  • filter for crimes within range 2021-08 -> 2022-07
  • sum crime instances across yearly period, grouped by borough and crime type
  • combine with population data and calcuate crime rate per 1,000 people during period
  • combine with deprivation data
  • explore relationships
In [ ]:
# df1 = crimes[(crimes['Month'] >= '2021-08') & (crimes['Month'] <= '2022-07')]
df1 = incidents[(incidents['Month'] >= '2021-08') & (incidents['Month'] <= '2022-07')]
In [ ]:
df2 = df1.groupby(['Borough'], as_index=False)['Crime Category'].value_counts()
In [ ]:
## load population data, merge, calculate crime rate
pop_df = pd.read_csv('external_data/Borough_pop_census2021.csv', index_col = 0)
df3 = pd.merge(df2, pop_df, on=['Borough'])
df3['Crime Rate'] = ((1000*df3['count']) / df3['population']).round(2)
df3.drop(['count'], axis=1, inplace=True)       ## drop count column as only need crime rate
In [ ]:
## convert from long-form to wide-form dataframe format
df4 = pd.pivot(df3, index=['Borough', 'population'], columns=['Crime Category'], values='Crime Rate')
df4.reset_index(inplace=True)
In [ ]:
## load deprivation data, clean, and combine with crime df
dep_df = pd.read_csv('external_data/Borough_deprivation.csv')

# convert to percentages
dep_df['Income deprivation rate (%)'] = dep_df['Income deprivation rate (%)'] * 100
dep_df['Deprivation gap (%)'] = dep_df['Deprivation gap (%)']*100
df5 = pd.merge(df4, dep_df, on='Borough')

## drop data for City of London -> exceptionally low population (roughly 8000) wildly distorts crime rates
df5 = df5[df5['Borough'] != 'City of London']

dep_df.head()
Out[ ]:
LAD code 2019 Borough Profile Deprivation gap (%) Deprivation gap ranking Moran's I Moran's I ranking Income deprivation rate (%) Income deprivation rate ranking Income deprivation rate quintile
0 E09000001 City of London Less income deprived 20.0 255 -0.15 316 7.0 280 5
1 E09000002 Barking and Dagenham More income deprived 25.0 195 0.27 175 19.0 20 1
2 E09000003 Barnet n-shape 32.0 132 0.36 105 11.0 148 3
3 E09000004 Bexley Flat 26.0 194 0.57 21 11.0 169 3
4 E09000005 Brent More income deprived 32.0 126 0.55 26 16.0 65 2

ONS definitions:

  • Deprivation rate measures the 'proportion of the population experiencing deprivation relating to low income'.
  • Moran's I is measured from -1 to +1, where +1 is highly clustered, and -1 is like a chessboard, with a completely uniform mix of high and low deprivation neighbourhoods.
  • Deprivation gap measures the difference between the most and least deprived neighbourhoods with that borough, to see which have the greatest gaps between the extremes (a small deprivation gap does not necessarily imply there is no deprivation, it may just mean income deprivation is evenly spread throughout the local authority, rather than concentrated in a few neighbourhoods).

4.2a - Visualising the relationships with scatter plots¶

In [ ]:
# available crime types
'Anti-social behaviour', 'Bicycle theft', 'Burglary', 'Criminal damage and arson', 'Drugs', 'Other theft', 'Possession of weapons', 'Public order', 'Robbery', 'Shoplifting', 'Theft from the person', 'Vehicle crime', 'Violence and sexual offences', 'Other crime'
In [ ]:
drugs = alt.Chart(df5).mark_circle().encode(
    x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 22])),
    y = alt.Y('Drugs:Q', title=None),
    color = alt.Color('Borough:N', legend=None),
    size = 'population',
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('population:Q', title='Population', format=','), alt.Tooltip('Drugs:Q', title='Drug crime rate'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
    title = 'Drug offence crime rate (per 1,000 people)',
    width = 350
)

shoplifting = alt.Chart(df5).mark_circle().encode(
    x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 22])),
    y = alt.Y('Shoplifting:Q', title=None),
    color = alt.Color('Borough:N', legend=None),
    size = 'population',
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('population:Q', title='Population', format=','), alt.Tooltip('Shoplifting:Q', title='Shoplifting crime rate'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
    title = 'Shoplifting crime rate (per 1,000 people)',
    width = 350
)

violence = alt.Chart(df5).mark_circle().encode(
    x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 22])),
    y = alt.Y('Violence and sexual offences:Q', title=None),
    color = alt.Color('Borough:N', legend=None),
    size = 'population',
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('population:Q', title='Population', format=','), alt.Tooltip('Violence and sexual offences:Q', title='Violence and sexual offences'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
    title = 'Violence and sexual offences (per 1,000 people)',
    width = 350
)

## could be shortened by using .repeat() but lose ability to set unique titles.
In [ ]:
drugs | shoplifting | violence
Out[ ]:

This shows 3 crime rate types the 2019 income deprivation rate for each London borough. The deprivation rate is defined by the ONS as measuring the 'proportion of the population experiencing deprivation relating to low income'.

As such, an upward trend without many outliers suggests that crime rates correlate with higher income deprivation.

Both drug offences and violence & sexual offences crime types show a clear upward correlation, while (perhaps surprisingly) shoplifting shows a relatively flat relationship with income deprivation.

Again, Westminster is a clear outlier with a near-average deprivation rate but a crime rate 2-4 times higher than any other borough - the high tourist + day population compared to the resient population is the most likely cause.

4.2b - How is income deprivation correlated across all crime types?¶

There is clearly some relationship (whether spurious or not) between income deprivation and crime rates, so motivates exploring across all crime types

In [ ]:
## create list of unique crime categories to iterate over
types = list(df2['Crime Category'].unique())
In [ ]:
corr_df = pd.DataFrame(columns=['Crime Type', 'Correlation', 'p-value'])
for crime in types:
    data1 = df5['Income deprivation rate (%)']
    data2 = df5[crime]
    # calculate Pearson's correlation
    corr, pvalue = pearsonr(data1, data2)
    row = [crime, corr, pvalue]
    corr_df.loc[len(corr_df)] = row
In [ ]:
## conditional cell highlighting function
def pvalue_highlight(row):
    val = row.loc['p-value']        # assign valuefrom p-value column to test
    if val < 0.01:
        color = '#9080ff'       
    elif val < 0.05:
        color = '#776bcd'
    elif val < 0.1:
        color = '#48446e'
    else:
        color = ''
    return ['background-color: {}'.format(color) for r in row]

corr_df.style.apply(pvalue_highlight, axis=1).format('{:.4f}', subset=['Correlation','p-value'])
Out[ ]:
  Crime Type Correlation p-value
0 Violence and sexual offences 0.5320 0.0017
1 Anti-social behaviour 0.4683 0.0069
2 Vehicle crime 0.3740 0.0350
3 Other theft 0.1506 0.4107
4 Criminal damage and arson 0.4514 0.0095
5 Drugs 0.4454 0.0106
6 Public order 0.3503 0.0494
7 Burglary 0.4312 0.0137
8 Shoplifting -0.0066 0.9712
9 Robbery 0.3861 0.0291
10 Theft from the person 0.1772 0.3318
11 Other crime 0.0517 0.7786
12 Bicycle theft 0.2228 0.2203
13 Possession of weapons 0.5177 0.0024

The highlighted rows indicate a statistically significant correlation between income deprivation rate and the respective crime type. The lightest purple indicates significance at the 1% level (Violence and sexual offences, anti-social behaviour, possesion of weapons), the normal purple indicates significance at the 5% level (vehicle crime, criminal damage and arson, drugs, burglary, robbery), while the darkest purple indicates significance at the 10% level (public order).

So, we find statistically significant correlations between income deprivation rate and most crime types, with the following exceptions:

  • Other theft
  • Shoplifting
  • Theft from the person
  • Other crime
  • Bicycle theft

4.2c - Do other measures of inequality show statistical relationships with crime?¶

The ONS deprivation data contains other useful metrics that could factor into crime rates. Moran's I measures the extent to which deprivation is clustered. For instance, areas with high deprivation rates but low deprivation clusterings may have lower crime rates than similarly deprived areas with high clustering. The deprivation gap measures the percentage difference between the most and least deprived neighbourhoods in that area and thus is an indication of local inequality.

Each of these three metrics are somewhat interlinked, but distinct enough to prevent any significant multicollinearity in a combined model - can check this manually with VIF

First, we can explore the relationship with drug crime offences.

In [ ]:
## set independent and dep vars
X = df5[['Income deprivation rate (%)', "Moran's I", 'Deprivation gap (%)']]
y = df5['Drugs']
In [ ]:
X = sm.add_constant(X)         # add a constant, ie with 0 for each of our values, we would still expect crime rate to be non-zero
model =  sm.OLS(y,X).fit()      # fit model
predictions = model.summary()       # summarise results
predictions
Out[ ]:
OLS Regression Results
Dep. Variable: Drugs R-squared: 0.315
Model: OLS Adj. R-squared: 0.241
Method: Least Squares F-statistic: 4.282
Date: Wed, 11 Jan 2023 Prob (F-statistic): 0.0131
Time: 16:03:24 Log-Likelihood: -65.450
No. Observations: 32 AIC: 138.9
Df Residuals: 28 BIC: 144.8
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -3.1381 2.497 -1.257 0.219 -8.252 1.976
Income deprivation rate (%) 0.1902 0.116 1.634 0.113 -0.048 0.429
Moran's I -3.2379 2.684 -1.207 0.238 -8.735 2.259
Deprivation gap (%) 0.2152 0.099 2.178 0.038 0.013 0.418
Omnibus: 44.567 Durbin-Watson: 1.369
Prob(Omnibus): 0.000 Jarque-Bera (JB): 195.795
Skew: 2.919 Prob(JB): 3.05e-43
Kurtosis: 13.619 Cond. No. 264.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression results are somewhat surprising, in that both income deprivation rate and Moran's I have p-values above 0.1, and thus are shown to be poor predictors of drug crime rate.

Conversely, deprivation gap is statistically significant at the 5% level with a coefficient value of 0.222. This suggests, for every 1% rise in the deprivation gap (ie disparities between neighbourhoods and concentrations of deprivation), the drug crime rate can be expected to increase rise by 0.222 per 1000 people.

It's worth noting, the adjusted r-squared value is relatively low at 0.239, suggesting only 24% of drug crime rate disparities are explained by our independent variables.

In [ ]:
## check multicollinearity 
vif = pd.DataFrame()
vif['VIF factor'] = [variance_inflation_factor(X.values, i) for i in range(X.values.shape[1])]
vif["features"] = X.columns
print(vif.round(1))
   VIF factor                     features
0        49.9                        const
1         1.2  Income deprivation rate (%)
2         1.5                    Moran's I
3         1.6          Deprivation gap (%)

High multicollinearity between independent variables can inflate standard errors and thus distort parameter estimates. All VIF factors are close to 1, we can therefore conclude that the 3 deprivation metrics have little correlation between eachother - i.e. sufficiently low that we can accept the inference of the results.

Also, from the scatter plots above, the variance between boroughs seems consistent across deprivation rates and thus would not expect the heteroskedasticity to be a problem.

4.2d - Extending the regression model to all crime types¶

We can apply this same methodology to each crime type, using a forloop to fit the model between each types and the three deprivation measures.

To intuitively display the data together: first a multi-index dataframe is created so the coefficient and p-value of each explanatory can be displayed together, then all the data is added in the same forloop that runs fits the model, finally can apply a function that conditionally highlights the cells - applying this to individual cells in a multi-index dataframe proved especially tricky hence the long function.

In [ ]:
headers = [
    np.array(['Crime Type', 'Income deprivation rate (%)', 'Income deprivation rate (%)', "Moran's I", "Moran's I", 'Deprivation gap (%)', 'Deprivation gap (%)']),
    np.array(['', 'coef', 'p-value', 'coef', 'p-value', 'coef', 'p-value']),
]       # set headers for multi-index dataframe
regress_df = pd.DataFrame(columns=headers)
X = df5[['Income deprivation rate (%)', "Moran's I", 'Deprivation gap (%)']]
for crime in types:
    y = df5[crime]
    X = sm.add_constant(X)      ## add constant to model
    model =  sm.OLS(y,X).fit()
    row = [crime, model.params[1].round(4), model.pvalues[1].round(4), model.params[2].round(4), model.pvalues[2].round(4), model.params[3].round(4), model.pvalues[3].round(4)]
    regress_df.loc[len(regress_df)] = row
In [ ]:
## conditional cell highlighting function
def pvalue_highlight(row):
    s10 = 'background-color: #48446e'
    s5 = 'background-color: #776bcd'
    s1 = 'background-color: #9080ff'
    default = ''

    if row['Income deprivation rate (%)']['p-value'] < 0.1: 
        I = s10
        if row['Income deprivation rate (%)']['p-value'] < 0.05:  
            I = s5
            if row['Income deprivation rate (%)']['p-value'] < 0.01: 
                I = s1
    else: I = default

    if row["Moran's I"]['p-value'] < 0.1: 
        M = s10
        if row["Moran's I"]['p-value'] < 0.05:  
            M = s5
            if row["Moran's I"]['p-value'] < 0.01: 
                M = s1
    else: M = default

    if row['Deprivation gap (%)']['p-value'] < 0.1: 
        D = s10
        if row['Deprivation gap (%)']['p-value'] < 0.05:  
            D = s5
            if row['Deprivation gap (%)']['p-value'] < 0.01: 
                D = s1
    else: D = default

    return [default, I, default, M, default, D, default]        ## only highlight coef. columns

## for each row apply the function pvalue_highlight.
regress_df.style.apply(pvalue_highlight, axis=1).format('{:.4f}', subset=['Income deprivation rate (%)', "Moran's I", 'Deprivation gap (%)'])
Out[ ]:
  Crime Type Income deprivation rate (%) Moran's I Deprivation gap (%)
  coef p-value coef p-value coef p-value
0 Violence and sexual offences 0.8698 0.0249 -4.1379 0.6287 0.7615 0.0211
1 Anti-social behaviour 0.9329 0.0864 -11.1572 0.3645 1.0995 0.0200
2 Vehicle crime 0.2160 0.1204 4.3685 0.1710 0.2684 0.0264
3 Other theft -0.1163 0.8774 -11.0515 0.5265 1.4871 0.0264
4 Criminal damage and arson 0.1062 0.1027 -0.8733 0.5524 0.1345 0.0179
5 Drugs 0.1902 0.1135 -3.2379 0.2377 0.2152 0.0380
6 Public order 0.1368 0.3689 -3.0731 0.3811 0.3383 0.0128
7 Burglary 0.1204 0.1961 -2.2737 0.2873 0.2758 0.0013
8 Shoplifting -0.1433 0.3306 -0.5314 0.8746 0.2971 0.0224
9 Robbery 0.1478 0.2594 -2.8326 0.3468 0.2865 0.0138
10 Theft from the person 0.0145 0.9840 -15.1994 0.3664 1.3242 0.0385
11 Other crime 0.0292 0.4767 1.2377 0.1952 -0.0330 0.3445
12 Bicycle theft 0.0021 0.9843 -7.0480 0.0067 0.2389 0.0117
13 Possession of weapons 0.0260 0.0404 -0.4068 0.1560 0.0238 0.0279

The results show consistency with initial drug crime model: income deprivation rate is not a good measure on its own and the strongly significant results found previously are likely evidence of omitted variable bias.

Instead, with all three metrics considered, deprivation gap (%) outperforms income deprivation rate and Moran's I in predicting crime rates. This suggests that, in general, boroughs with the most extreme divides between poor and rich areas can expect to see higher crime rates. This is true irrespective of the more general rate of deprivation, except for violence and sexual offences, possession of weapons, and ASB, which correlate with both the income deprivation rate and gap.

This could be interpreted in many ways, for instance, those suffering from income deprivation may be more aggrieved to their situation if living near the most affluent - although Moran's I is not a good estimator.

These results serve to justify that the link between inequality and crime is complex - i.e. lower incomes alone cannot be said to increase crime.


4.3. How did the pandemic, and government imposed lockdowns, affect crime patterns?¶

4.3a - What was the effect of the lockdowns on the number of crimes?¶

In [ ]:
df3a = incidents.groupby(['Month'], as_index=False)['Crime Category'].value_counts()
In [ ]:
## Sum the total number of crimes committed by month and crime type
alt.Chart(incidents.groupby(['Month'], as_index=False)['Crime Category'].value_counts()).mark_area().encode(
    x = alt.X('Month:T', title=None),
    y = alt.Y('count:Q', title=None),
    color = "Crime Category:N"
).properties(
    width = 700,
    title = 'London: monthly crime incidents by type'
)
Out[ ]:

Asides from anti-social behaviour, this stacked area chart shows clear dips for the lockdown periods, first around April 2020 and then again towards the end of 2020 with the lockdown 2 (November 2020) and lockdown 3 from the start of January. Breaches of covid restrictions were generally recorded as anti-social behaviour and thus explains the sharp rise in ASB. Interestingly, this was less clear in lockdowns 2 and 3, suggesting some change either in behaviour or policing.

However, how this was spread among different crime types is still unclear.

4.3b - Which types of crime were most affected by lockdowns?¶

For this, we consider the first lockdown period from April -> June 2020, as well as the three month periods either side of that.

In [ ]:
## create filtered dataframes for each period
preL1 = crimes[(crimes['Month'] >= '2020-01') & (crimes['Month'] <= '2020-03')]
L1 = crimes[(crimes['Month'] >= '2020-04') & (crimes['Month'] <= '2020-06')]
postL1 = crimes[(crimes['Month'] >= '2020-07') & (crimes['Month'] <= '2020-09')]
In [ ]:
df_preL1 = pd.DataFrame(preL1['Crime Category'].value_counts()).reset_index()
df_preL1['Period'] = 'Jan-Mar'

df_L1 = pd.DataFrame(L1['Crime Category'].value_counts()).reset_index()
df_L1['Period'] = 'Apr-Jun'

df_postL1 = pd.DataFrame(postL1['Crime Category'].value_counts()).reset_index()
df_postL1['Period'] = 'Jul-Sep'
In [ ]:
periods_li = [df_preL1, df_L1, df_postL1]
df3a = pd.concat(periods_li, ignore_index=True)
df3a.columns = ['Crime Category', 'Crime Total', 'Period']
In [ ]:
alt.Chart(df3a).mark_bar().encode(
    x = alt.X('Period:O', title=None, sort=['Jan-Mar', 'Apr-Jun', 'Jul-Sep']),
    y = alt.Y('Crime Total:Q'),
    color = alt.Color('Period:N'),
    column = alt.Column('Crime Category:N', title='London: effects of the first Covid-19 lockdown (2020)'),
    tooltip = [alt.Tooltip('Period:N'), alt.Tooltip('Crime Total:Q', format=',')]
)
Out[ ]:
In [ ]:
df3a.head()
Out[ ]:
Crime Category Crime Total Period
0 Violence and sexual offences 56160 Jan-Mar
1 Vehicle crime 31759 Jan-Mar
2 Other theft 28134 Jan-Mar
3 Burglary 18510 Jan-Mar
4 Theft from the person 14908 Jan-Mar

This shows crimes of most types to fall or stay the same in the first Covid lockdown. Crimes relating to theft fell the most, between

Bicycle theft and drug offences are the only categories to show any noticeable increase, although the increase in bicycle theft continues drastically in the 3 months following, suggesting this could be a season effect. The sharp rise (roughly 30%) and then decrease of drug offences either side of the lockdown period suggests drug behaviour may have been considerably affected. However, again this may not infer a direct relationship (i.e. isolation increased drug taking etc), and could result from other effects on policing such as the ease of tracking criminals.

Violence and sexual offences remained at a similar level, with sharp rise following easing of restrictions. Many of its sub-categories (don't have data for), such as minor assult and harassment, could be expected to decrese considerably under lockdown. The lack of any significant change could be due to other crime types increasing during lockdown, such as domestic crimes.

In [ ]:
 

4.3c - Was there a geographical change in criminal activity?¶

Using a hexbin plot with matplotlib, the individual crime incidents can be binned and then displayed as a chloropleth map to identify crime hotspots.

In [ ]:
fig, ax = plt.subplots(1,2, figsize=(18,6), sharey=True)
ax[0].hexbin(x=crimes[crimes['Month'] == '2020-04']['Longitude'], y=crimes[crimes['Month'] == '2020-04']['Latitude'])
ax[1].hexbin(x=crimes[crimes['Month'] == '2022-04']['Longitude'], y=crimes[crimes['Month'] == '2022-04']['Latitude'])

plt.show()

Plotting every crime incident for the first full lockdown month (April 2020) and April 2022 shows a massive change in geographic distribution.

The lockdown plot better resembles a standard population density visualisation, with dark areas for London's larger parks, rivers and reservoirs. This suggests that the crimes that did still happen during the lockdown were likely committed closer to perpertrator's residence.

This could partially support the idea that the policing of certain offences, such as drug crime, became easier during lockdown and thus explained the rise (or lack of fall) of incidents.

4.3d - Have lockdowns caused any structural changes in crime?¶

For this, we can look at how crime rates 'recovered' after lockdown and whether this was distributed evenly.

First, must find the monthly crime totals per borough, merge with the population data and then calculate crime rates. Then can filter for our chosen Covid period of January 2020 to July 2021.

In [ ]:
## calculate monthly crime total per borough
m_count = crimes.groupby(['Borough', 'Month'], as_index=False)['Month'].value_counts()

## merge with population data and calculate crime rates
m_rate = pd.merge(m_count, pop_df, on=['Borough'])
m_rate['Crime Rate'] = ((1000*m_rate['count']) / m_rate['population']).round(2)

## filter df to remove City of London crimes
df_f = m_rate[(m_rate['Borough'] != 'City of London')]
# m_rate.drop(['count'], axis=1, inplace=True)
In [ ]:
## filter dataframe for covid period: January 2020 -> June 2021
covid = df_f[(df_f['Month'] >= '2020-01') & (df_f['Month'] <= '2021-06')].reset_index(drop=True)

To evaluate whether pandemic effects have been equally distributed, we can consider a starting point from which each borough is equal. This starting point can be the index from which all later values are measured.

  • Any structural effects are most likely to exhibit at the extremes, so we can rank the boroughs by income deprivation and then extract the most and least deprived.
  • For this, we will iterate through the boroughs, then take a borough-specific subset of the full dataframe, to which the indexing calculation can be applied (dividing every subsequent month's value by that first value).
  • Lastly, merge crime data and deprivation data, and apply filter to dataframe so only contains boroughs featuring in most and least deprived subset.
In [ ]:
## sort deprivation data and remove City of London
f1 = dep_df.sort_values(by=['Income deprivation rate (%)'], ascending=False, ignore_index=True)
f2 = f1[(f1['Borough'] != 'City of London')]

## creates a list of most and least deprived boroughs
high_dep = list(f2.head(8)['Borough'])
low_dep = list(f2.tail(8)['Borough'])
In [ ]:
## create list of unique boroughs to iterate over
boroughs = list(covid['Borough'].unique())
## calculate index for crime rates = 100 at t1
index = []
for b in boroughs:
    subset = covid[(covid['Borough'] == b)].reset_index(drop=True)
    indexrow = subset[:1]      ## select first row (ie April 2020 value)
    subset['Rate_indexed'] = (subset['Crime Rate']/ indexrow['Crime Rate'][0]) * 100
    index.extend(list(subset['Rate_indexed']))

covid['Index'] = index

# merge with deprivation data
covid_df = pd.merge(covid, dep_df, on='Borough')

# filter for most and least deprived
tails_covid_df = covid_df[(covid_df['Borough'].isin(low_dep)) | (covid_df['Borough'].isin(high_dep))]
In [ ]:
alt.Chart(tails_covid_df).mark_line(point='transparent').encode(
    x = alt.X('Month:O', title=None, axis=alt.Axis(labelAngle=-30, labelOffset=12)),
    y = alt.Y('Index:Q', title=None, scale=alt.Scale(domain=(20, 140))),
    detail = alt.Detail('Borough:N'),
    color = alt.Color('Income deprivation rate (%):Q', scale=alt.Scale(scheme='blueOrange')),
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Month:N'), alt.Tooltip('Crime Rate:Q', format='.3', title='Crime Rate (per 1,000 people)'), alt.Tooltip('Income deprivation rate (%):Q'), alt.Tooltip('Index:Q', format='.3')]
).properties(
    title = 'Average monthly crime rate during the pandemic',
    width = 600,
    height = 320
)
Out[ ]:

This graph shows the development of crime rates for a subset of London boroughs (8 most + 8 least income deprived). The initial Covid / lockdown shock is clear between March and April 2020, with crime rates falling by between 15-45%. Crime rates bottom out again in January 2021, the first month of the 3rd lockdown.

By incorporating a diverging colour scale amongst the boroughs, its clear that the affect on crime rates is unequally distributed. Boroughs with higher income deprivation (indicated by the orange lines) typically saw a smaller reduction in crimes rates during lockdowns. This continued in the 'return-to-normal' periods with high deprivation areas returning to near or above pre-pandemic levels within 2-3 months of the 1st and 3rd lockdowns, while low deprivation areas settled at crime rates roughly 10% (0-20%) below pre-pandemic levels.

Can we see these changes in the period preceeding the pandemic and period following it?¶

To consider any differences in impact on trends, we can consider the 6-month periods leading to the first lockdown, and the 6-month period following the lifting of all covid restrictions.

In [ ]:
## filter new dataframes for before and after covid period
pre_covid = df_f[(df_f['Month'] >= '2019-10') & (df_f['Month'] <= '2020-03')].reset_index(drop=True)
post_covid = df_f[(df_f['Month'] >= '2021-07') & (df_f['Month'] <= '2021-12')].reset_index(drop=True)
In [ ]:
# calculates the monthly mean crime rate across the 6 month period, and merges with deprivation data
pre_df = pd.merge(pre_covid.groupby('Borough', as_index=False)['Crime Rate'].mean(), dep_df, on='Borough')
post_df = pd.merge(post_covid.groupby('Borough', as_index=False)['Crime Rate'].mean(), dep_df, on='Borough')

Can use the lists of most and least deprived boroughs to filter the dataset

In [ ]:
low_pre_df = pre_df[(pre_df['Borough'].isin(low_dep))]
high_pre_df = pre_df[(pre_df['Borough'].isin(high_dep))]

low_post_df = post_df[(post_df['Borough'].isin(low_dep))]
high_post_df = post_df[(post_df['Borough'].isin(high_dep))]

Lastly, calculate the mean crime rate among the most and least deprived boroughs.

In [ ]:
low_diff = (low_post_df['Crime Rate'].mean() - low_pre_df['Crime Rate'].mean())/low_pre_df['Crime Rate'].mean()
high_diff = (high_post_df['Crime Rate'].mean() - high_pre_df['Crime Rate'].mean())/high_pre_df['Crime Rate'].mean()
In [ ]:
print(f'Change in crime rate amongst most deprived boroughs: {high_diff: .2%}')
print(f'Change in crime rate amongst least deprived boroughs: {low_diff: .2%}')
Change in crime rate amongst most deprived boroughs: -0.71%
Change in crime rate amongst least deprived boroughs: -6.86%

This shows the difference in average monthly crime rate for 6 month period prior to and following the pandemic (Apr 2020 - Jun 2021).

The most income deprived boroughs have seen a negligent change in crime rates, while the least income deprived have seen an almost 7% reduction in crime. This decrease in crime rate could be representive of a long-term trend and thus not a direct result of the pandemic. However, the failure for all boroughs to match this trend strongly suggests an unequal distribution of detrimental effects resulting from the pandemic (and its associated impacts on wellbeing, livelihood etc).

Whatever factors are underlying this difference, structurally different areas will always vary in their sensitivity to exogenous shocks. This could also example how sometimes equal treatment (i.e. through income support schemes etc) can actually be regressive.


extra code to delete¶

In [ ]:
pre_covid.groupby('Borough', as_index=False)['Crime Rate'].mean()
In [ ]:
pre = alt.Chart(pre_df).mark_circle().encode(
    x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 24])),
    y = alt.Y('Crime Rate:Q', title=None, scale=alt.Scale(domain=[0, 35])),
    detail = 'Borough:N',
    color = alt.Color('Income deprivation rate quintile:N'),
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Crime Rate:Q', format='.3', title='Crime Rate (per 1,000 people)'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
    title = 'Average crime rate pre-Covid'
)

post = alt.Chart(post_df).mark_circle().encode(
    x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 24])),
    y = alt.Y('Crime Rate:Q', title=None, scale=alt.Scale(domain=[0, 35])),
    detail = 'Borough:N',
    color = alt.Color('Income deprivation rate quintile:N'),
    tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Crime Rate:Q', format='.3', title='Crime Rate (per 1,000 people)'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
    title = 'Average crime rate post-Covid'
)

pre | post
Out[ ]:

Filter for richest and poorest boroughs, what is the difference in average crime rates pre & post-covid

  • could add a filter for boroughs that have high deprivation gaps and deprivation rates
  • regressions, have the correlations between deprivation and crime become stronger or weaker

Create table with monthly crime rates around lockdowns

Flaws in analysis:

  • constant population across period, so if increasing linearly, crime rates could be slighly suppressed and inflated for the periods before and after which the population is accurate to. However, the time period is still relatively short so this shouldn't be too influential. Also, census 2011 population data has often been used for crime rates, so using the newest 2021 estimates will still serve as an improvement.

To better compare boroughs, this adjusts for population differences using 2021 Census population estimates. This is used across the whole time period (imperfect as population is dynamic). It also does not account for the larger daytime populations of the inner London boroughs.


5. Conclusion¶

  • Write a summary of what you've learned from the analysis.
  • Share ideas for future work on the same topic using other relevant datasets/sources
4.1 Are crime rates decreasing?¶
  • We found crime rates, while steady overall (with the exception of some lockdown effects), actually can vary significantly when looking at crime type or by location.
4.2 Is there a difference between crime types of the poorest and richest boroughs?¶
  • Next, we found there is a clear statistical difference between the types of crime committed when comparing boroughs on varying inequality. The deprivation gap (%) was found to be the most significant predictor.
4.3 How did the pandemic, and government imposed lockdowns, affect crime patterns?¶

Finally, we found that lockdowns have disproportionately affected different London boroughs, with the richest and poorest boroughs seeing clear divergence in crime rates.

Further ideas:

Another interesting dataset is for police stop and search. This is again recorded with (rough) coordinate locations, but also includes demongraphic features. A potentially idea could be looking at the clustering of crimes and seeing if stop and search corrosponds to these metrics: i.e. how targeted is stop and search?